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(54) Apparatus and methods for sharing idle workstations 



(57) The present invention relates to systems for 
sharing idle workstation connputers that are connected 
together through a network and shared file system. 
More particularly, a user of a local host workstation may 
submit jobs for execution on remote workstatfons. The 
systems of the present invention select a remote host 
that is idle in accordance with a decentralized schedul- 
ing scheme and then continuously monitor the activity 
of the remote host on which the job is executing. If the 



system detects certain activity on the remote host by 
one of the remote host's primary users, the execution of 
the job is immediately suspended to prevent inconven- 
ience to the primary users. The system also suspends 
job execution if the remote host's load average gets too 
high. Either way, the suspended job is migrated by se- 
lecting another idle remote workstation to resume exe- 
cution of the suspended job (from the point in time at 
which the last checkpoint occun-ed). 
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Description 

Cross Reference to Related Applications 

The present invention is related to the following In- 
ternational Patent Applications: 'Persistent State 
Checkpoint And Restoration Systems.' PCT Patent Ap- 
plication No. PCTAJS95/07629 and 'Checkpoint And 
Restoration Systems For Execution Control," PCT Pat- 
ent Application No. PCTAJS95/07660, both of which are 
assigned to the assignee of the present invention and 
both of which are incorporated herein by reference. Ad- 
ditionally, the foltowing articles are also incorporated 
herein by reference: Y.M. Wang et al., "Checkpointing 
and Its Applications," Symposium on Fault-Tolerant 
Computing . 1995, and G.S. Fowler, "The Shell as a 
Sen/ice," USENIX Conference Proceedings . June 
1993. 

Background of The Invention 

The present invention relates to networked compu- 
ter systems. In particular, the present invention relates 
to computing environments that are populated with net- 
worked workstations sharing network file systems. One 
known system for sharing idle resources on networked 
computer systems is the Condor system that was devel- 
oped at the University of Wisconsin (kjle resources refer, 
generally, to a networked workstation having no user in- 
put commands for at least a certain period of time). The 
Condor system is more fully described in 1^. Litzkow et 
al., "Condor - a hunter of idle workstations." Proc. ICD- 
CS, pp. 104-111, 1988. 

Some of the main disadvantages of the Condor sys- 
tem are as follows. Condor uses a centralized scheduler 
(on a centralized sen/er) to allocate network resources, 
which is a potential security risk. For example, having a 
centralized server requires that the sender have root 
privileges in order for it to create processes that imper- 
sonate the individual users who submit jobs to it for re- 
mote execution. Any lack of correctness in the coding 
of the centralized sen/er (e.g., any "bugs") may allow 
someone other than the authorized user to gain access 
to other user's privileges. Secondly, the Condor system 
migrates the execution of a job on a remote host as soon 
as it detects any mouse or keyboard activity (i.e., user 
inputs) on the remote host. Therefore, if any user, In- 
cluding users other than the primary user, begins using 
the remote host, shared execution is terminated and the 
task is migrated. This causes needless migratbn and 
its concomitant work lossage. Thirdly, due to the nature 
of a centralized server, starting the sen/er can only be 
accomplished by someone with root privileges. 

Another known system for accomplishing resource 
sharing is the Coshell system (which is described more 
fully in the G.S. Fowler article "The shell as a sen/ice," 
incorporated by reference above). Coshell, unfortunate- 
ly also has disadvantages, such as the fact that Coshell 



also suspends a job on the remote host whenever the 
remote host has any mouse or keyboard activity. Addi- 
tionally. Coshell cannot migrate a suspended job (a job 
that was executing remotely on a workstation and was 
s suspended upon a user input at the renriote workstation) 
to another machine, but has to wait until the mouse or 
keyboard activity ends on the remote host before resum- 
ing execution of the job. The fact that Coshell suspends 
the job's execution in response to any mouse or key- 
to board activity creates needless suspenstons of the job 
when any mouse or keyboard activity is sensed at the 
remote host. 

It would therefore be desirable to provide systems 
and methods of more efficiently sharing computational 

IS resources between networked workstations. 

It would also be desirable to provide program exe- 
cution on idle, remote workstations in which suspensbn 
of the program execution is reduced to further increase 
processing efficiency. 

20 It would be still further desirable to provide a net- 
work resource sharing architecture in which suspended 
jobs may be efficiently migrated to alternate idle work- 
stations when the initial idle, remote workstatkxi Is once 
again in use by the primary user. 

2S 

Summary of the Invention 

The above and other objects of the invention are 
accomplished by providing methods for increasing the 
30 efficiency in sharing computational resources in a net- 
work architecture having multiple workstations in which 
at least one of those workstatbns is idle. The present 
invention provides routines that, once a job is executing 
remotely, do not suspend that job unless the primary us- 
3S er 'retakes possession" of the workstation (versus any 
user other than the primary user). Additionally, the 
present invention provides the capability to migrate a re- 
mote job, once it has been suspended, to another idle 
workstation in an efficient manner, rather than requiring 
40 that the job remain on the remote workstation in a sus- 
pended state until the remote workstation becomes idle 
again. 

Brief Description of the Drawings 

45 

The above and other objects of the present inven- 
tion will be apparent upon consideration of the following 
detailed description, taken In conjunction with the ac- 
companying drawings, in which like reference charac- 
so ters refer to like parts throughout, and In which: 

FIG. 1 is an illustrative schematic diagram that de- 
picts substantially all of the processes that are typ- 
ically active when a single user, on the local host, 
55 has a single job running on a remote host in accord- 
ance with the principles of the present inventbn; 
and 

FIG. 2 is a table that depicts a representative sam- 
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pie of an attributes file in accordance with the prin- 
ciples of the present Invention. 

Detailed Description of The Invention 

FIG. 1 is an illustrative schematic diagrann that de- 
picts substantially all of the processes that are typically 
active when a single user (whom shall be referred to as 
user X), on the local host, has a single job running on a 
remote host. FIG. 1 is divided Into halves 100 and 101, 
where half 100 depicts processes 102-105 running on 
the local host 100, and half 101 depicts processes 
106-110 running on a remote host 101 . Thus, user X on 
local host 100 (i.e.. the left half of FIG. 1) has a single 
job 109 running on a remote host 101 (i.e., the right half 
of FIG. 1). 

Hosts 100 and 101 are connected together through 
a network and share a file system. While FIG. 1 illus- 
trates a network having only two hosts, it should be un- 
derstood that the techniques of the present Invention will 
typically be used in computing environments where 
there are multiple hosts, all being networked together, 
either directly together or in groups of networks (e.g., 
multiple local area networks, or LANs, or through a wide 
area network, or WAN), and sharing a file system. 

The process by which a user, such as user X, sub- 
mits a job for execution on a remote host is as follows, 
as Illustrated by reference to FIG. 1. The following ex- 
ample assumes a point in time at which none of the proc- 
esses described in FIG. 1 have been started. In this il- 
lustrative example, the principles of the present inven- 
tion are applied to the Coshell system that was de- 
scribed and incorporated by reference above. 

Initially, a user starts a coshell process (a coshell 
process is a process that automatically executes shell 
actions on lightly loaded hosts in a local network) as a 
daemon process (a process that, once started by a user 
on a workstation, does not terminate unless the user ter- 
minates it) on the user's local host. Thus, in FIG. 1 , user 
X starts coshell process 104 as a daemon process on 
local host 100. Every coshell daemon process serves 
only the user who has started it. For example, while an- 
other user Y may log onto host 100, only user X can 
communicate with coshell process 104. 

Upon being started, a coshell daemon process 
starts a status daemon (unless a status daemon is al- 
ready running), on Its own local host, and on every other 
remote host to which the coshell may submit a job over 
the network. In the case of FIG. 1 , coshell 104 has start- 
ed status daemon 105 running on host 100 and status 
daemon 110 running on host 101. 

Each status daemon collects certain current infor- 
mation regarding the host it is running on (i.e., status 
daemon 105 collects information about host 100, while 
status daenK>n 1 1 0 collects information about host 101) 
and posts this information to a unique status file which 
is visible to all the other networked hosts. This current 
infomiation comprises: (i) the one minute k>ad average 
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of the host as determined by the Unix command uptime, 
and (ii) the time elapsed since the last activity on an input 
device, either directly or through remote login, by a prl- 
nnary user of the host. All of the status files of all the 

5 status daennons are kept in a central directory. A central 
directory is used to reduce the network traffic that would 
otherwise occur if, for example, every status daemon 
had to post its current information to every other host on 
the network. In particular, a central directory of status 

10 files causes the network traffic to scale linearly with the 
number of hosts added. Furthemriore, each status dae- 
mon posts its current information every 40 seconds plus 
a random fraction of 10 seconds the random fractton 
is intended to further relieve network congestion by tem- 

15 porarity distributing the posting of current information. A 
coshell uses these status files to select an appropriate 
remote host upon which to run a job submitted to it by 
the coshell's user. 

Each coshell selects a remote host Independently 

20 of all of the other coshells that may be running and is 
therefore said to implement a form of decentralized 
scheduling. The coshells avoid the possibility of all sub- 
mitting their jobs to the same remote host by having a 
random smoothing component in their remote host se- 

25 lection algorithm. Rather than choosing a single best re- 
mote host, each coshell chooses a set of candidate re- 
mote hosts and then picks an individual host from within 
that set randomly. 

A coshell chooses a set of candidate remote hosts 

30 based upon the current information in the status files 
and the static information in an attributes file. There is 
a single attributes file shared by all the coshells. For 
each host capable of being shared by the present inven- 
tion, the attributes file typically stores the following at- 

3S tributes: type, rating, idle, mem and puser An illustrative 
example of an attributes file is shown in the table of FIG. 
2. Each line of FIG. 2 contains, from left to right, the host 
name followed by assignments of values to each of the 
attributes for that host. 

40 The type attribute differentiates between host 
types. For example, the "sgi.mlps" In line 1 of FIG. 2 in- 
dicates a host of the Silicon Graphics workstation type, 
while ■sun4" indicates a Sun Microsystems workstation 
type. The rating attribute specifies the speed for a host 

45 In MIPS (millions of instructfons per second). All hosts 
have preferably had their MIPS rating evaluated accord- 
ing to a common benchmark program. This ensures the 
validity of MIPS comparisons between hosts. The idle 
attribute specifies the minimum time which must elapse 

50 since the last primary user activity on a host (as posted 
in the host's status ftle) before the present Invention can 
use the host as a remote host. The mem attribute de- 
scribes the size (in megabytes) of the main memory of 
the host, while the puser attribute describes the primary 

55 users of the host. For example, host 'banana' of FIG. 2 
has two primary users indicated by the quoted list "em- 
erald ymwang," while host "orange" only has the primary 
user 'ruby' (no quotes are necessary when there Is only 
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one primary user). 

in general, the primary users attribute of a host is 
simply a subset of the universe of potential users of the 
host, selected as primary because they should be ac- 
corded extra priority in having access to the host's re- & 
sources. 

Typically, if the host is physically located in a partic- 
ular person's office, then the primary users of that host 
would include the occupant of that office as well as any 
administrative assistants to the occupant. For certain 
computer installations, it is possible to ascertain the oc- 
cupant of the office containing the host from the name 
of the home directory on the host* s physically attached 
disk. The names of these home directories can be col- 
lected automatically, over the network, thereby making 
it easier to create and maintain an attributes file. 

In addition to, or as an alternative to, the occupant 
of an office containing the host, the primary users of a 
host may include anyone who logs Into the host from the 
host's console (where the console comprises the key- 
board and video screen which are physically attached 
to the host). 

Included among any other qualiflcatbns required of 
a remote host for it to be included in the set of candidate 
remote hosts chosen by a coshell, is the requirement 
that the remote host be idle. A remote host is considered 
idle it all of the following three conditions are satisfied: 
(I) the one minute toad average (as posted in the host's 
status file) is less than a threshold (with the threshold 
typically being approximately 0.5); (ii) the time elapsed 
since the last primary user activity (as posted in the 
host's status file) is less than the period of time specified 
for the host by the idle attribute in the attributes file 
(where the period of time is typically approximately 15 
minutes); and (ill) no nontrivial jobs are being run by a 
primary user of the remote host. 

A nontrivial job is typically determined as follows. 
The Unix command "ps" (which means processor sta- 
tus) is executed periodically with the execution of the 
two ps commands separated from each other by about 
one minute. An execution of the ps command returns, 
for each process on the host, a flag indicated whether 
or not the process was running as of the time the ps 
command was executed, if for both executions d ps the 
flags for a process indicate that the process was run- 
ning, then the process is nontrivial. If a process Is indi- 
cated as having stopped, for at least one of the two ps 
executions, then the process is considered trivial. 

Once a user has a coshell daemon running, the user 
can start a particular job running on a remote host by 
starting a submit program. The submit program takes 
as arguments the job's name as well as any arguments 
which the job itself requires. The job name can identify 
a system command, applicatton program object file or a 
shell script. The job name can also be an alias to any of 
the Items listed In the previous sentence. In FIG. 1 , user 
X has started submit process 102 running with the name 
for job 1 09 as an argument. A submit process keeps run- 
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ning until the job passed to it has finished executing on 
a remote host. Therefore, submit process 102 keeps 
running until job 109 is finished. 

If, for example, user X wants to run another job re- 
motely, user X will have to start another submit process. 
Each such submit process keeps running until the job 
passed to it as an argument has finished executing on 
its remote host. Thus, there may be several submit proc- 
esses running at the same time for the same user on a 
given local host (e.g., on local host 100). 

The first action a submit process takes upon being 
started is to spawn a coshell client process. For exam- 
ple, submit process 102 has started client process 103 
through spawning action 113. As used throughout the 
present application, spawning refers to the general 
process by which a parent process (e.g. , submit process 
102) creates a duplicate child process (e.g., client proc- 
ess 103). In Unix, the spawning action is accomplished 
by the forfr system call. A client process provides an out- 
put on the local host for the standard error and standard 
output of the job executing on a remote host. A client 
process also has two way communication with its 
coshell via command and status pipes. Client process 
1 03 receives the standard error and the standard output 
of job 1 09. In addition, client process 1 03 communicates 
with its coshell 104 via command pipe 114 and status 
pipe 115. 

Once the client process has been started, its coshell 
then proceeds to select a remote host upon which to run 
the submitted job. As discussed above, each coshell se- 
lects a remote host independently of all the other 
coshells and utilizes random smoothing to prevent the 
overloading of any one remote host, if there are no idle 
remote hosts for a coshell to execute a submitted job 
upon, the coshell will queue the job until an idle remote 
host becomes available. Coshell 104, in the example 
shown in FIG. 1, has selected idle remote host 101 to 
mn the submitted job. 

Having selected a remote host, the coshell next 
starts a shell on the remote host - unless the coshell 
already has a shell running on that remote host left over 
from a previous job submission by that coshell to that 
same remote host. Assuming that the coshell does not 
already have a shell running on the selected remote 
host, it will start one with the Unix command rsh. A 
coshell has two way communication with the shell it has 
created via a command pipe and a status pipe. In FIG. 
1, coshell 104 has started a shell ksh 106 on remote 
host 101. Coshell 104 and ksh 106 communicate via 
command pipe 116 and status pipe 120. 

It should be noted that the type of shell started on 
a remote host by a coshell, referred to as a ksh, differs 
from an ordinary Unix shell only In its ability to send its 
standard output and standard error (produced by a re- 
motely submitted job) over the network and back to its 
coshell. The coshell then routes this standard output 
and standard error to the correct client process. In the 
case of FIG. 1, the standard output and standard error 
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Of job 109. which is running in ksh 106. is sent overthe 
network to coshell 104. Coshelt 104 then routes this 
standard output and standard error to client process 
103. 

Once a ksh is running on the selected remote host, 
a program called LDSTUB is started under the ksh. An 
LDSTUB process is started under the ksh for each job, 
from the ksh's coshell, which is to be executed on the 
selected remote host. In the case of FIG. 1, LDSTUB 
process 107 is started under ksh 106. Upon being start- 
ed, the LDSTUB process begins the execution of the 
user's job and a monitor daemon by spawning both of 
them off. Thus, LDSTUB process 107 has started mon- 
itor daemon 1 08 through spawn 1 1 9 and has started us- 
er X!s job 109 through spawn 118. 

A monitor daemon is a fine-grain polling process. It 
checks for the following conditions on the selected re- 
mote host: (i) any activity on an input device, either di- 
rectly or through remote login, by a primary user of the 
selected remote host; (11) any nontrivial jobs being run 
by a primary user of the selected remote host; or (iii) a 
one minute load average on the selected remote host 
(as determined by the monitor daenrK)n itself running the 
Unix command uptime) which is equal to or greater than 
a particular threshold (typically the threshold is about 
3.5). A monitor daemon is a fine-grain polling process 
because it frequently checks for the above three condi- 
tions, typically checking every 10 seconds. If one or 
more of the above three conditions is satisfied, the mon- 
itor daemon sends a signal to its LDSTUB process. This 
signal to its LDSTUB process starts a process, de- 
scribed below, which migrates the user's job to another 
remote host. In the example shown in FIG. 1. monitor 
daenfK3n 108 checks for each of the above three condi- 
tions on remote host 101 . 

The user's job is usually run on the remote host hav- 
ing been already linked with the Libckp checkpointing 
library. Libckp Is a user-transparent checkpointing li- 
brary for Unix applications. Libckp can be linked with a 
user's program to periodically save the program state 
on stable storage without requiring any modification to 
the user's source code. The checkpointed program state 
Includes the following in order to provide a truly trans- 
parent and consistent migration: (i) the program coun- 
ter; (ii) the stack pointer; (iii) the program stack; (iv) open 
file descriptors; (v) global or static variables; (vl) the dy- 
namically allocated memory of the program and of the 
libraries linked with the program; and (vii) all persistent 
state (i.e., user files). User x's job 109. In this example, 
has been linked with Libckp library 111 . 

Each of the processes shown in FIG. 1, namely 
processes 102-110, are active at this point in the job 
submission process. 

Up to this point in the description of the job submis- 
sion process, the focus has been on presenting the 
process by which a single job is submitted for remote 
execution by a single user. It should be noted, however, 
that a user of a k)cal host may start a second job running 
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on a remote host before the first submitted job has fin- 
ished executing. This second job submission is coordi- 
nated with the processes already running for the first job 
submissbn as follows. 

s First, the user starts the submit program a second 
time with the second job name as an argument (in the 
same manner as described above for the first job). This 
second starting of the submit program creates a second 
submit process just for managing the remote execution 

10 of the second job. The second submit process spawns 
a second coshell client process (similar to coshell proc- 
ess 104) which is also just for the remote execution of 
the second job. The second coshell client process, how- 
ever, communicates with the same coshell which the 

15 first coshell client process communicates with. 

The second coshelt then selects a remote host for 
the second job to execute upon and starts a ksh on that 
remote host - unless the coshell already has a ksh run- 
ning on that remote host. For example, if the coshell 

20 chooses the same remote host upon which the first job 
is executing, then the coshell will use the same ksh in 
which the first job is executing. Regardless of whether 
the coshell needs to create a new ksh or not, within the 
selected ksh a second LDSTUB process, just for the re- 

25 mote execution of the second job. is started. The second 
LDSTUB process spawns the second job and a second 
monitor daemon. 

In general then, it can be seen that if a user has N 
jobs remotely executing which have all been submitted 

30 from the same local host, then the user will have a 
unique set of submit, client, LDSTUB, monitor daemon 
and job processes created for each of the N jobs sub- 
mitted. All of the ^jobs will share the same coshell dae- 
mon. Any of the A/ jobs executing on the same remote 

35 host will share the same ksh. 

The remainder of this description addresses the op- 
eration of the present invention with respect to a single 
remotely executing job for one user of a kx:al host. For 
any additional remotely executing jobs submitted by the 

40 same user from the same local host, each such addi- 
tional job will have its own unique set of independent 
processes which will respond to conditions in the same 
manner as described bek>w. 

At the point in the job submission process when a 

45 user's job is remotely executing, one of two main con- 
ditions will occur: (i) one or more of the three conditions 
described above as being checked for by a monitor dae- 
mon Is satisfied causing the user's job to be migrated to 
another remote host; or (ii) the remotely executing job 

so will finish. The process is initiated, at this point in the job 
submission process, by each of these two main condi- 
tions occurring as described, in turn, below. 

If the first main condition (which may be one or more 
of the three conditions checked for by a nrtonitor daemon 

55 occurring) is satisfied, the following steps will occur in 
order to migrate the user's job to another remote host. 
Each of these steps will be explicated by reference to 
the specific process configuration of FIG. 1 . 
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First, the satisfaction of the first main condition 
causes the monitor daemon to exit. The exiting of the 
monitor daemon causes a 'SIGCHLD' signal to be sent 
to its parent LDSTUB process. In FIG. 1. the exiting of 
monitor daemon 108, upon the satisfaction of the first 
main condition, causes a "SIGCHLD" signal to be sent 
to monitor daemon 108's parent process LDSTUB 107. 

Second, the LDSTUB process kills the user's job. 
In FIG. 1, LDSTUB process 107 kills user X's job 109. 
If the user's job has been linked with Libckp, as is the 
case for job 109, a subsequent restart of the job will only 
mean that work since the last checkpoint is lost. Typi- 
cally. Libckp performs a checkpoint every 30 minutes so 
that only the last 30 minutes of job executk>n, at most, 
will be lost. 

The third step is as follows. There is a socket con- 
nection between an LDSTUB process and its submit 
process. Over this socket connection the LDSTUB proc- 
ess sends a message to its submit process telling it to 
have the user's job run on another remote host, LD- 
STUB process 107 sends such a message over socket 
112 to submit process 102. 

(n the fourth step, the submit process kills the client 
process it had originally spawned off to communicate 
with the remote job through cosheli. The submit process 
also kills its LDSTUB process. In the case of FIG. 1 , sub- 
mit process 102 kills client process 103 and LDSTUB 
process 107. 

In the fifth step, the submit process resubmits the 
user's job to another remote host according to the sub- 
mission process described above. Submit process 102 
resubmits job 109 to coshell 104 for execution on an- 
other remote host. The subsequent processes and the 
other remote host which would be a part of job 109's 
resubmission are not shown in FIG. 1. When the user's 
job is executed on another remote host, the first action 
of the job is to check for a checkpoint file. If the job finds 
a checkpoint file it restores the state of the job as of the 
point in time of the last checkpoint. 

It is important to note that the ksh under which the 
user's job had been running, along with its pipe connec- 
tions to its cosheli, is kept running. This ksh may be used 
again later by its coshell provkJed that the remote host 
upon which the ksh is running becomes idle again by 
satisfying the three conditions described above for a 
candidate remote host. In the case of FIG. 1, ksh 106 is 
kept running along with its command pipe 116 and sta- 
tus pipe 120 connections to coshell 104. 

Alternatively, if the second main condition de- 
scribed above (of the remotely executing job finishing) 
is satisfied, the following steps will occur. As with the 
first main condition, each of these steps will be explicat- 
ed by reference to the specific process configuration of 
FIG, 1, 

First, when the user's job finishes execution on a 
remote host, it exits causing a 'SIGCHLD' signal to be 
sent to its parent process (its LDSTUB process). In re- 
sponse to receiving the "SIGCHLD" signal. LDSTUB ob- 



tains the exit status number (a status number is returned 
by every exiting Unix process to indicate its completion 
status) of the user's job with the Unix system call wait 
(2). When user X's job 1 09 finishes execution on renrtote 

s host 101 it exits causing a 'SIGCHLD'' signal to be sent 
to its parent LDSTUB process 107. LDSTUB process 
107 then obtains job 109's exit status number with the 
wait(2) system call. 

Second, the LDSTUB process kills its monitor dae- 

10 mon. LDSTUB process 107 kills monitor daemon 108. 
The third step is as follows. There is a socket con- 
nectksn between the LDSTUB process and its submit 
process. Over this socket connection the LDSTUB proc- 
ess sends a message to its submit process telling it that 

IS the user's job has finished and containing the status 
number returned by the user's job. LDSTUB process 
107 sends such a message over socket 112 to submit 
process 102. 

In the fourth step, the submit process kills the client 

20 process it had originally spawned off to communicate 
with the remote job through coshell. The coshell then 
detects that a particular client process has been killed 
by a particular signal number. In response to detecting 
this the coshell sends a message to the appropriate ksh. 

2S The message sent by the coshell instructs the ksh to kill 
the LDSTUB process, corresponding to the submit proc- 
ess, using the same signal number by which the submit 
process killed Its client process. In the case of FIG. 1 , 
submit process 1 02 kills client process 1 03. Coshell 1 04 

30 then detects that client process 103 has been killed by 
a particular signal number In response to detecting this, 
coshell 1 04 sends a message to ksh 1 06. The message 
sent by coshell 104 instructs ksh 106 to kill LDSTUB 
process 1 07 with the same signal number by which sub- 

35 mit process 102 killed client process 103. 

In the fifth step, the submit process exits with the 
same status number returned by the remotely executed 
job. Submit process 102 exits with the status number 
retumed by the execution of job 109 on remote host 101 . 

40 As with the migration process described above, it is 
important to note that the ksh under which the user's job 
had been running, along with its pipe connections to its 
coshell, is kept running. This ksh may be used again 
later by its coshell provided that the remote host upon 

4S which the ksh is running becomes Idle again by satisfy- 
ing the three conditions described above for a candidate 
remote host. In the case of FIG. 1, ksh 106 is kept run- 
ning along with its command pipe 116 and status pipe 
120 connections to coshell 104. 

so The workstations shared through the present Inven- 
tion can be all of one type (homogeneous workstations). 
An example is a network where only Sun Microsystems 
workstations can be shared. Alternatively, the present 
invention can be used to share a variety of workstatbn 

55 types (heterogeneous workstattons). An example is a 
network where both Sun Microsystems workstations 
and Silicon Graphics workstations can be shared. 
Where the present inventbn is used with heteroge- 
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neous workstations, the migration of a job, based upon 
workstation type, is typically as follows. Where no type 
is specified upon submitting a job, the job is limited to 
migrate among workstations of the same type as which 
the job was submitted from. The user can specify how- 
ever, upon submitting a job, that the job only be execut- 
ed upon workstations of one particular type and this type 
need not be the same as the type of workstation from 
which the job was submitted. The user can also specify, 
upon submitting a job, that the job can be executed on 
any workstation type. 

Persons skilled in the art will appreciate that the 
present inventk)n may be practiced by other than the 
described embodiments, which are presented for pur- 
poses of illustration and not of limitation, and the present 
inventbn is limited only by the claims which folbw. 



Claims 

1. A system for sharing computer resources among a 
plurality of computers connected together by a net- 
work, comprising: 

a local computer that accepts jobs for remote 
execution, the host computer being one of the 
plurality of computers; 

a remote computer that is capable of receiving 
jobs for remote execution, the remote computer 
being one of the plurality of computers and ex- 
ecuting a current status process to collect cur- 
rent information that includes activity informa- 
tion about primary users of the remote compu- 
ter; and 

a scheduling computer that chooses a location 
for remote execution of the accepted jobs 
based upon at least the primary user current 
information regarding the remote computer, the 
scheduling computer being one of the plurality 
of computers. 

2. The system of claim 1 , wherein the scheduling com- 
puter and the local computer are the same compu- 
ter. 

3. The system of claim 1 , wherein the scheduling com- 
puter and the remote computer are the same com- 
puter. 

4. The system of claim 1 , wherein the local computer, 
the remote computer and the scheduling computer 
are three different computers of the plurality of com- 
puters. 

5. The system of claim 1 , wherein the scheduling com- 
puter capability of choosing a location for remote 
execution is executed as a decentralized schedul- 
ing process among the plurality of computers. 



6. The system of claim 1 , wherein the local computer 
executes a submission process in response to a job 
being submitted for remote execution by a user of 
the local computer. 

5 

7. The system of claim 1 . wherein the remote compu- 
ter also executes: 

a monitor process that monitors the activity of 
primary users of the remote computer. 

70 

8. The system of claim 1 , wherein the activity informa- 
tion includes time elapsed since a last activity on an 
input device by a primary user of the remote com- 
puter. 

IS 

9. The system of claim 8, wherein the input device is 
directly connected to the remote computer. 

10. The system of claim 8, wherein the input device is 
20 connected to the remote computer through a re- 
mote login process. 

11. The system of claim 1, wherein the scheduling com- 
puter chooses a remote computer as the computer 

2S to remotely execute a job only if the time elapsed 
since a last activity on the input device by a primary 
user of the remote computer is greater than a pre- 
determined threshold. 

30 12. The system of claim 1, further comprising a file sys- 
tem shared by at least the local and remote com- 
puters. 

13. The system of claim 12, wherein a primary user of 
3S the remote computer is determined by attribute in- 
formation stored in a central locatbn in the file sys- 
tem. 

14. The system of claim 13. wherein the scheduling 
40 computer chooses a kscation based on current in- 
formation kept in the file system and the current sta- 
tus process on each remote computer posts the cur- 
rent information it collects to the central location in 
the file system. 

4S 

15. A system for sharing computer resources among a 
plurality of computers connected together by a net- 
work, comprising: 

so a local computer that accepts jobs that a user 

has submitted to the local computer for execu- 
tion on a renoote computer; and 
a remote computer, selected by a scheduling 
process, that executes the job submitted by the 

ss user, and executes a monitor process that 

causes the execution of the job to be suspend- 
ed if the monitor process detects activity on the 
remote computer by a primary user of the re- 
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mote computer. 

16. The system of claim 15, wherein the remote com- 
puter suspends execution of the job if activity by a 
primary user of the remote computer is detected on 
an input device of the remote computer. 

17. The system of claim 1 5, wherein the local computer 
executes a submission process such that if the job 
is suspended by the remote computer, the job is re- 
submitted by the submission process to the sched- 
uling process for execution on a second renrK)te 
computer. 

18. The system of claim 15, wherein the monitor proc- 
ess Is spawned from a parent process. 

19. The system of claim 15. wherein the monitor proc- 
ess is a fine-grain polling process. 

20. The system of claim 1 5, wherein the job is linked 
with a checkpointing library. 

21. The system of claim 15. further comprising a file 
system shared by at least the local and remote com- 
puters. 

22. The system of claim 21 , wherein a primary user of 
the remote computer is determined by attribute in- 
formatk)n stored in a central location in the file sys- 
tem. 

23. The system of claim 15, wherein the scheduling 
process is a decentralized scheduling process. 

24. The system of claim 23, wherein the decentralized 
scheduling process is executed on the local com- 
puter that submits a job for remote execution. 

25. The system of claim 15, wherein the scheduling 
process is executed by a single processor for the 
entire network. 

26. A method of sharing computer resources among a 
plurality of computers connected together by a net- 
work comprising the steps of: 

accepting jobs for remote execution on a first 
computer; 

executing a scheduling process to schedule the 
remote execution of the accepted jobs based 
on current information about each of the plural- 
ity of computers that may be remotely ac- 
cessed, the current information including activ- 
ity information about primary users of each of 
the remotely accessible computers; and 
running a current status process on at least one 
second computer of the plurality of computers, 



the second computer being one of the remotely 
accessible computers, the current status proc- 
ess operating to collect current informatk)n 
about the second computer. 

5 

27. The method of claim 26, further comprising the step 
of: 

executing a submissbn process on the first 
computer in response to a job being submitted by a 
10 user of the first computer. 

28. The method of claim 26, wherein the step of running 
a current status process collects activity information 
that includes time elapsed since a last activity on 

'5 an input device by a primary user of the second 
computer. 

29. The method of claim 26, wherein the step of exe- 
cuting a scheduling process chooses a remote com- 

20 puter for remote execution of the submitted job only 
If the time elapsed since a last activity on an input 
device by a primary user of the remote computer is 
greater than a predetemnined threshold, the remote 
computer being one of the second computers. 

25 

30. The method of claim 29. further comprising the 
steps of: 

executing one of the accepted jobs on the re- 
30 mote computer chosen by the schedu ling proc- 

ess; and 

monitoring the remote computer for activity by 
a primary user of the remote computer. 

3S 31 . The method of claim 30, further comprising the step 
of: 

suspending the remote execution of the job 
when activity by a primary user is sensed during the 
step of monitoring. 

40 

32. The method of claim 31 , further comprising the step 
of: 

resubmitting the suspended job to the sched- 
uling process for execution on a different remote 
^5 computer when activity by a primary user is sensed 
during the step of monitoring. 

33. The method of claim 30, further comprising the step 
of: 

so linking the job to a checkpointing library prior 

to executing the job. 

34. The method of claim 26, wherein the step of exe- 
cuting a scheduling process is executed as a de- 

ss centralized scheduling process. 

35. The method of claim 26, wherein the step of exe- 
cuting a scheduling process is executed as a cen- 
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tralized scheduling process on a single processor. 
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